Geospatial Data Analysis in R

Vector Data Sources

Vector data sources represent geographic features as geometric shapes. Different kinds of shapes are used to represent different kinds of features:

  • Points are used to represent features as a simple location such as an observation site.
  • Line segments are used to represent curvilinear features such as roads and rivers.
  • Polygons are used to represent regions on Earth’s surface such as countries and lakes.

In addition to geometry, shapes usually have a variety of associated data called attributes. For example a point representing a United States address would probably have street, city, state, and zip code attributes in addition to its geographic coordinates.

Vector File Formats

Comma-separated Values

Probably the most common vector format is text-based XYZ point coordinates. In this format each line corresponds to one point, usually with each coordinate separated by a comma. This format is called comma-separated values (CSV) or, more generally, delimited text.

x1,y1,z1
x2,y2,z2
x3,y3,z3

The Z coordinate is sometimes omitted, for example when points are identified only by longitude and latitude. Very often the Z coordinate is re-purposed to record information about the point, such as a measurement that occurred there.

An advantage of this format is that it is human readable and highly portable. Virtually all computers and software can read and write it. It is also easily manipulated using UNIX shell utilities. A disadvantage is that it is inefficient for large data sets and can only represent point geometry. Another disadvantage is its very limited support for metadata, which is typically limited to column names.

Shapefiles

Another very common vector format is ESRI Shapefile. This open source format originated in the 1990s with ArcView GIS. Actually the term shapefile is a misnomer: a shapefile is a collection of several files:

  • .shp – contains feature geometry
  • .shx – look-up file to accelerate finding individual features
  • .dbf – feature attribute database
  • .prj – projection data (optional)

Shapefiles are capable of storing points, lines, and polygons, and up to 255 attributes per feature. Attribute names are limited to 10 characters. Shapefiles have good support for metadata.

Shapefiles are generally not human readable but they are portable and well supported by GIS software, making them a good choice for storing vector data. One limitation is that each shapefile can only contain one kind of geometry: points can’t be stored with lines, for example, so you need separate files for each geometry type. In addition, line and polygons are internally stored as collections of points instead of mathematical curves (splines), which can result in large files.

Loading Shapefiles in R

Shapefiles are opened in R with the rgdal package.

library(rgdal)
## Loading required package: sp
## rgdal: version: 1.4-3, (SVN revision 828)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01
##  Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/gdal
##  GDAL binary built with GEOS: FALSE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/proj
##  Linking to sp version: 1.3-1
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Data Sources and Layers

sf routines for opening and inspecting shapefiles don’t directly accept a shapefile path argument. Instead they accept a data source name (dsn) and layer name. The data source name is the directory containing the shapefile. The layer name is the name of the shapefile without the file extension (.shp).

You can query a data source for a list of available layers with the st_layers function.

st_layers(dsn="ncshape")
## Driver: ESRI Shapefile 
## Available layers:
##               layer_name  geometry_type features fields
## 1        boundary_county     3D Polygon      926     25
## 2        boundary_municp     3D Polygon     3579     27
## 3                bridges       3D Point    10938     14
## 4              busroute1 3D Line String        3      2
## 5             busroute11 3D Line String        1      2
## 6              busroute6 3D Line String        2      2
## 7             busroute_a 3D Line String        8      2
## 8           busroutesall 3D Line String      100      2
## 9            busstopsall       3D Point      118     22
## 10       census_wake2000     3D Polygon      105     13
## 11      censusblk_swwake     3D Polygon     2557     68
## 12         comm_colleges       3D Point       58     14
## 13     elev_lid792_bepts       3D Point   118716      1
## 14    elev_lid792_cont1m 3D Line String       73      2
## 15   elev_lid792_randpts       3D Point     6000      2
## 16   elev_lidrural_mrpts       3D Point    36944      4
## 17 elev_lidrural_mrptsft       3D Point    36944      4
## 18   elev_ned10m_cont10m 3D Line String     1008      2
## 19          firestations       3D Point       71     22
## 20          geodetic_pts       3D Point    29939     13
## 21   geodetic_swwake_pts       3D Point      263     13
## 22               geology     3D Polygon     1832      8
## 23           geonames_NC       3D Point    48557     18
## 24         geonames_wake       3D Point     1088     18
## 25             hospitals       3D Point      160     16
## 26                 lakes     3D Polygon    15279      8
## 27              nc_state     3D Polygon        8      3
## 28            overpasses       3D Point    17754      7
## 29      planimetry_rural 3D Line String     1000      4
## 30    planimetry_rural3d 3D Line String     1000      4
## 31        poi_names_wake       3D Point     1090     18
## 32     precip_30ynormals       3D Point      136     18
## 33  precip_30ynormals_3d       3D Point      136     18
## 34             railroads 3D Line String    10837     12
## 35            roadsmajor 3D Line String      355      7
## 36          schools_wake       3D Point      167     29
## 37         soils_general     3D Polygon     1428      7
## 38            soils_wake     3D Polygon     3446      7
## 39               streams 3D Line String     8554     14
## 40          streets_wake 3D Line String    49746     35
## 41            swwake_10m     3D Polygon        1      1
## 42             urbanarea     3D Polygon      675      5
## 43             usgsgages       3D Point      433     14
## 44         zipcodes_wake     3D Polygon       48     12

Layer Details

To obtain detailed information about a layer, call ogrInfo from rgdal with the data source and layer name.

ogrInfo(dsn = "ncshape", layer = "firestations")
## Source: "/Users/tkeitt/Dropbox/R/keitt.ssi.2019/inst/materials/exercises/ncshape", layer: "firestations"
## Driver: ESRI Shapefile; number of rows: 71 
## Feature type: wkbPoint with 3 dimensions
## Extent: (615797.7 200696.3) - (671786.6 253490.8)
## CRS: +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +datum=NAD83 +units=m +no_defs  
## LDID: 0 
## Number of fields: 22 
##          name type length  typeName
## 1         cat   12     11 Integer64
## 2          ID   12     11 Integer64
## 3       LABEL    4     80    String
## 4    LOCATION    4     80    String
## 5        CITY    4     80    String
## 6   MUN_COUNT    4     80    String
## 7     PUMPERS   12     11 Integer64
## 8  PUMPER_TAN   12     11 Integer64
## 9      TANKER   12     11 Integer64
## 10 MINI_PUMPE    2     24      Real
## 11 RESCUE_SER   12     11 Integer64
## 12     AERIAL   12     11 Integer64
## 13      BRUSH   12     11 Integer64
## 14     OTHERS   12     11 Integer64
## 15 WATER_RESC   12     11 Integer64
## 16    MUNCOID   12     11 Integer64
## 17   BLDGCODE    4     80    String
## 18     AGENCY    4     80    String
## 19  STATIONID    4     80    String
## 20      RECNO    2     24      Real
## 21    CV_SID2    4     80    String
## 22      CVLAG    2     24      Real

The firestations layer contains locations and attributes of fire stations in Wake County, NC. The “Feature type” line indicates that this layer has point geometry. The “Driver” line shows that the layer contains 71 points. The “Fields” table shows that each point has 22 attributes.

Opening Shapefiles

Shapefiles are opened in R with st_read.

firestations <- st_read(dsn = "ncshape", layer = "firestations")
## Reading layer `firestations' from data source `/Users/tkeitt/Dropbox/R/keitt.ssi.2019/inst/materials/exercises/ncshape' using driver `ESRI Shapefile'
## Simple feature collection with 71 features and 22 fields
## geometry type:  POINT
## dimension:      XYZ
## bbox:           xmin: 615797.7 ymin: 200696.3 xmax: 671786.6 ymax: 253490.8
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +datum=NAD83 +units=m +no_defs

The type of object returned by st_read depends on the layer geometry. It will contain POINTS, POLYGONS, MULTIPOLYGONS, etc. For example, we can determine the number of fire stations in each Wake County city. This example uses dplyr to count the number of firestations in each city.

firestations %>% group_by(CITY) %>% summarize(count = n()) -> num_fs
print(num_fs)
## Simple feature collection with 12 features and 2 fields
## geometry type:  GEOMETRY
## dimension:      XYZ
## bbox:           xmin: 615797.7 ymin: 200696.3 xmax: 671786.6 ymax: 253490.8
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +datum=NAD83 +units=m +no_defs
## # A tibble: 12 x 3
##    CITY        count                                               geometry
##    <fct>       <int>                                         <GEOMETRY [m]>
##  1 Apex            4 MULTIPOINT Z (615797.7 213364 0, 622136.9 220306.4 0,…
##  2 Cary            7 MULTIPOINT Z (622996.3 226225.3 0, 627041.5 224084.3 …
##  3 Fuquay-Var…     3 MULTIPOINT Z (627926.3 203208.6 0, 634666.7 207474.5 …
##  4 Garner          2  MULTIPOINT Z (642560 215531.8 0, 644588.5 217450.6 0)
##  5 Holly Spri…     2  MULTIPOINT Z (623658.6 209478.2 0, 625889.9 211478 0)
##  6 Knightdale      3 MULTIPOINT Z (655179.7 222274.6 0, 656750.5 226365 0,…
##  7 Morrisville     3 MULTIPOINT Z (620857 230066.4 0, 624469.1 232478.1 0,…
##  8 Raleigh        37 MULTIPOINT Z (628006.3 238736.1 0, 628813.8 236394.5 …
##  9 Rolesville      1                          POINT Z (658578.9 241140.1 0)
## 10 Wake Forest     5 MULTIPOINT Z (641826.3 253490.8 0, 644714 247773 0, 6…
## 11 Wendell         2    MULTIPOINT Z (660134 232499.9 0, 665826 226513.6 0)
## 12 Zebulon         2  MULTIPOINT Z (667866.4 237035 0, 671786.6 230059.2 0)

Notice that the resulting data frame still has the geometries. We can plot this with ggplot.

ggplot(num_fs, aes(x = count, y = CITY)) + geom_point()

Activity: Wake County Schools

In this activity we will plot the number of elementary schools, middle schools, and high schools in Wake County cities. Open the schools_wake layer in the ncshape data source and create a table of schools in cities. Coerce the table to a data frame with column names city, level, and count, and use subset to obtain those rows with nonzero count and level “E”, “M”, or “H”. Create dotplots and barcharts of the data, experimenting with conditioning and grouping. Which plots are most informative?

Functions: sf::st_read, dplyr::count, dplyr::filter, ggplot2::ggplot, ggplot2::geom_point, ggplot2::facet_wrap

Plotting Vector Data

Plotting Shapes

The spatial data frames returned by st_read define methods for ggplot. We can subset the polygons in the boundary_county layer to plot an outline of Wake County:

boundary.county <- st_read("ncshape", "boundary_county", quiet = TRUE)
wake <- subset(boundary.county, NAME=="WAKE")
p1 <- ggplot(wake) + geom_sf() + theme_bw()
print(p1)

The firestations point layer can be added to the plot by simply appending a new layer and printing.

p1 <- p1 + geom_sf(data = firestations)
print(p1)

Similarly, the roadsmajor line layer can be added by appending a new ggplot layer:

roadsmajor <- st_read("ncshape", "roadsmajor", quiet = TRUE)
p1 <- p1 + geom_sf(data = roadsmajor)
print(p1)

Activity: Basic Plotting

Use the nc_state, hospitals, and railroads layers of the ncshape data source to create a map of North Carolina hospitals and railroads. Note that there are times when geom_sf is very slow. I have filled a bug report. It may take 5 minutes or more to actually draw the plot.

Functions: st_read, ggplot, geom_sf

Plotting Shapes with Attributes

While some maps are designed only to display shape and location, others are designed to highlight a theme connected to a geographic area. Such maps are called thematic maps, and they can be plotted in R using ggplot. Here we color Wake County zipcodes with a unique color for each city.

zipcodes.wake <- st_read(dsn = "ncshape", layer = "zipcodes_wake", quiet = T)
ggplot(zipcodes.wake) +
  geom_sf(aes(fill = NAME))

Map Projections

A map projection is a planar representation of Earth’s curved surface. While it is usually sufficient to treat Earth’s surface as flat over small geographic areas, over large areas the distortions introduced by flattening a curved surface become significant. Projections distort areas, shapes, distances, directions, and other properties of Earth’s surface. A map projection represents a particular trade-off among these distortions. Some preserve area at the expense of shape (equal-area projections), while others preserve shape at the expense of area (conformal projections). It is impossible to preserve both.

For these reasons, there is no single best projection. A projection must be chosen to suit the particular application, so very often when combining geospatial data from different sources you will find that different projections were used. In this case it is necessary to transform one projection to another in a process called reprojection. Since vector data sources are based on mathematical objects, reprojection is straightforward.

We can obtain shapefile projection details with the st_crs function:

countries <- st_read(dsn = "extra", layer = "ne_110m_admin_0_countries", quiet = T)
st_crs(countries)
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
ggplot(countries) +
  geom_sf(aes(fill = subregion))

Reprojection is done with st_transform:

countries <- st_transform(countries, st_crs("+proj=moll"))
ggplot(countries) +
  geom_sf(aes(fill = subregion))

Calculations on Vector Data

One of the greatest advantages of vector data sources is their suitability for geometric operations. Here we will highlight several useful operations provided by the sf package. These operate on and return sf or sfc objects. While they accept the spatial data frames returned from st_read as inputs, the attributes are discarded and only the geometries are returned.

Union

sf offers st_union for merging intersecting geometries. The boundary.county layer contains 926 polygons representing portions of 100 counties. We can use st_union in combination with dplyr (filter, group_by, summarize) to merge the polygons of individual counties, producing a sf object with one polygon per county.

counties <- boundary.county %>%
  filter(!is.na(NAME)) %>%
  group_by(NAME) %>%
  summarize(geometry = st_union(geometry))
ggplot(counties) +
  geom_sf(aes(fill = NAME)) +
  guides(fill = FALSE)

Activity: Merge Counties into Regions

In this activity we will group North Carolina counties longitudinally into four regions and merge each region into a single polygon. One approach is to use the st_coordinates function to get a vector of county centers. This vector of x coordinates can be converted to a vector of factor levels using cut. After a new sf is created from the factors, st_union can merge the polygons of each region.

Functions: st_centroid, st_coordinates, cut, group_by, summarize, st_union

## Warning in st_centroid.sf(counties): st_centroid assumes attributes are
## constant over geometries of x

Centroid

The st_centroid routine calculates the centroid of the given geometry.

county.centers <- st_centroid(counties)
## Warning in st_centroid.sf(counties): st_centroid assumes attributes are
## constant over geometries of x
ggplot(counties) +
  geom_sf(aes(fill = NAME)) +
  geom_sf(data = county.centers) +
  guides(fill = FALSE)

Buffer

Often the space surrounding a feature is of as much interest as the feature itself, particularly when intersecting overlaying points and lines on another dataset. We can use the gBuffer function to expand a geometry to include the area within a specified width. As with gCentroid, the default behavior is to create a buffer around the entire geometry, but by passing byid=TRUE we can calculate a buffer for individual features.

ggplot(nc_state) +
  geom_sf() +
  geom_sf(data = hospitals, color = "steelblue")

Now use st_buffer to create a 10 mile buffer around each hospital:

hospitals.buffer <- st_buffer(hospitals, 16000) # 16km is approx 10mi

Activity: Wake County Firestation Coverage

In this activity we will find the proportion of Wake County within 5km of a firestation using st_buffer, st_distance, and st_area.

Functions: st_buffer, st_combine, ggplot, st_area, st_difference

## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
## 0.9034588 [1]

Overlay

In sf it is extremely easy to join tables. sf overrides the [] indexing function when indexing with another sf table. Here is an example of joining two tables. It adds the columns from the boundary.county table to the hospitals table. We can use that to give a county name to each hosptial. We extract only the NAME column from the boundary.county table in this example.

hospitals.joined <- hospitals[boundary.county$NAME, ]
table(hospitals.joined$COUNTY)
## 
##     Alamance    Alexander    Alleghany        Anson         Ashe 
##            1            0            0            0            0 
##        Avery     Beaufort       Bertie       Bladen    Brunswick 
##            0            2            1            0            1 
##     Buncombe        Burke     Cabarrus     Caldwell     Carteret 
##            3            3            1           10            1 
##      Catawba      Chatham     Cherokee       Chowan    Cleveland 
##           34            2            1            0           28 
##     Columbus       Craven   Cumberland         Dare     Davidson 
##            1           32            3            1            1 
##        Davie       Duplin       Durham    Edgecombe      Forsyth 
##            1            1            3            1           39 
##     Franklin       Gaston    Granville     Guilford      Halifax 
##           45            1            2           21            1 
##      Harnett      Haywood    Henderson     Hertford         Hoke 
##            1            1            1            0            1 
##      Iredell      Jackson     Johnston          Lee       Lenoir 
##            1            1            2            1            1 
##      Lincoln        Macon       Martin     McDowell  Mecklenburg 
##            0            1            1            0            7 
##     Mitchell   Montgomery        Moore         Nash  New Hanover 
##            1            1           14            7            0 
##       Onslow       Orange   Pasquotank       Pender       Person 
##           60            1            0            2            1 
##         Pitt         Polk     Randolph     Richmond      Robeson 
##            0            0            1            2            0 
##   Rockingham        Rowan   Rutherford      Sampson     Scotland 
##            1            2            1           39            2 
##       Stanly       Stokes        Surry        Swain Transylvania 
##            1            0            1            2            1 
##        Union        Vance         Wake   Washington      Watauga 
##            0            0          186            0            1 
##        Wayne       Wilkes       Wilson       Yadkin       Yancey 
##            2            0            0            1            0

Note the presence of zeros in the table: the inner join operation returned the NAME attribute of every county in the layer, including those without a hospital. We can use the table to create a thematic map.

Activity: North Carolina Hospitals per County

Compute and plot the number of hospitals in each county in North Carolina. Hint: you can do an inner join between the boundary.county table and the hosptials table to get a separate row for each hospital-county combination.

Functions: st_join, group_by, summarize, n, st_simplify, ggplot, geom_sf

Spatial Interpolation

Spatial interpolation is the process of using the values of a continuous variable (e.g. temperature) at a set of sample points to estimate the value at every other point in a region of interest. The basic idea is to use a weighted average of the values at observed (sampled) points to estimate the values at unobserved points. All spatial interpolation methods are based on Tobler’s First Law of Geography:

Everything is related to everything else, but near things are more related than distant things.

We will demonstrate spatial interpolation with the gstat package and Meuse data set.

library(sf)
library(sp)
library(stars)
## Loading required package: abind
library(gstat)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
library(ggplot2)
data(meuse)
data(meuse.grid)
data(meuse.riv)

The first step is converting the data set to sp classes.

coordinates(meuse) <- ~x+y
proj4string(meuse) <- CRS("+init=epsg:28992")
coordinates(meuse.grid) <- ~x+y
proj4string(meuse.grid) <- CRS("+init=epsg:28992")
meuse.river <- SpatialPolygons(list(Polygons(list(Polygon(meuse.riv)), "meuse.riv")))
proj4string(meuse.river) <- CRS("+init=epsg:28992")

Now we convert to sf

meuse <- st_as_sf(meuse)
meuse.grid <- st_as_sf(meuse.grid)
meuse.river <- st_as_sf(meuse.river)

We can visualize the data set as a whole.

ggplot(meuse) +
  geom_sf(data = meuse.river) +
  geom_sf(aes(color = zinc, size = zinc))

Inverse Distance Weighting

There are many methods of spatially interpolating point data. A simple algorithm is inverse distance weighting where each interpolated value is a weighted sum of the values of its neighbors. The weight used is the inverse of the distance to all or a subset of the original points. To obtain an inverse distance interpolation, we use the krige function from the gstat package, but without specifying a fitted model.

zinc.nvd <- krige(zinc ~ 1, locations = meuse, newdata = meuse.grid)
ggplot(zinc.nvd) +
  geom_sf(aes(color = var1.pred))

Variograms

A variogram quantifies the autocorrelation of a set of points. The x-axis of a variogram is lag-distance between points. The y-axis gives the average squared difference between the z-values of points for a particular lag-distance. When the variogram is flat, there is no autocorrelation. When it rises steeply, there is autocorrelation. In geostatistics, we often fit a function to the variogram and then use that function to determine the contribution of neighbors to our interpolation. If the variogram flattens out at a very short distance, then we increase the contribution of nearby neighbors and the interpolation surface will be rough. If the variogram flattens out at a very long distance, we generate a more smooth interpolation of the point.

zinc.ev <- variogram(log(zinc)~1, meuse)
zinc.fitted <- fit.variogram(zinc.ev, model=vgm(1, "Sph", 900, 1))
plot(zinc.ev, zinc.fitted)

Ordinary Kriging

Ordinary Kriging assumes that the data are stationary, i.e., there is no spatial trend in the mean z-value. This is indicated below by the model formula only containing an intercept term (~1). A “Universal Kriging” model would contain formula terms related to the coordinates. This allows one to account for spatial trends in the mean value.

zinc.kriged <- krige(log(zinc)~1, meuse, meuse.grid, model=zinc.fitted)
ggplot(zinc.kriged) +
  geom_sf(aes(color = var1.pred))

Activity: Interpolate Annual Precipitation

Interpolate the annual rainfall data in precip_30ynormals layer.

Functions: st_read, st_sample, krige, ggplot, variogram, fit.variogram

## Warning in fit.variogram(precip.ev, model = vgm(1, "Sph", 50000, 1)): No
## convergence after 200 iterations: try different initial values?